Aishwarya Girish

Contents

Report

1. Introduction

Developmental bifurcations represent critical transitions in which cells or organisms diverge into distinct phenotypic fates. The freshwater planarian Phagocata morgani (P. morgani), undergoes such a bifurcation, developing into either an asexual or sexual reproductive phenotype, each associated with a distinct growth trajectory. However, the mechanisms that govern these divergent outcomes, particularly the specific cell types that drive commitment toward either fate, remain poorly understood.

This notebook presents a quantitative framework for identifying cell-type-specific drivers of sexual and asexual growth trajectories in P. morgani. A particular focus is placed on identifying cell-type behaviour around the bifurcation points as they are regions in the developmental manifold where fate decisions are likely to occur. By statistically evaluating these transitions, candidate cellular populations that may act as intermediates or determinants of reproductive fate are identified.

The notebook is organized to guide the reader through each step of this analysis: from data preprocessing and pseudotime trajectory inference to cell-type abundance modeling and statistical enrichment testing. It serves as both a computational workflow and an interpretative tool for understanding the cellular dynamics in planaria.

1.1. Background

Development in P. morgani proceeds along distinct trajectories that give rise to sexual and asexual phenotypes. These trajectories are shaped by both environmental cues and intrinsic organismal properties, such as body size. To begin disentangling these factors, worms were maintained at a constant temperature of 17.5℃ — a condition known to sustain a mixed population of sexual and asexual individuals. This temperature was specifically chosen to allow for the separation of size-dependent effects from temperature-driven variation in cellular differentiation dynamics.

A single-cell atlas was constructed from this population by isolating nuclei from individual worms. Each worm was flash frozen and processed independently, with one worm per well in a 96-well plate. Prior to freezing, a video recording was captured for each worm, enabling quantification of morphological features such as body area and length using a standardized image analysis pipeline.

The resulting dataset was processed into a gene expression matrix and subsequently converted into a cell-type abundance matrix, where each row represents an individual worm and each column corresponds to a broad cell type. To ensure analytical robustness, cell types with extremely low abundance (<50 cells) across all samples were filtered out. Worm-level cell counts were then aggregated based on assigned cell-type annotations, yielding a worm-by-cell-type matrix reflecting absolute cell composition.

In parallel, morphological measurements (specifically worm length and area) were extracted for each sample to serve as covariates in downstream modeling. The raw count matrix was normalized using Monocle3’s size factor estimation framework to account for differences in total nuclei recovered per worm. This step corrected for variability in sequencing depth and overall cell recovery, enabling meaningful comparisons of cell-type proportions across worms.

The resulting size-normalized matrix forms the basis for all subsequent analyses.

1.2. Getting Started

To ensure reproducibility and transparency, all analyses were performed within an RNotebook framework using R (version 4.4.2). The Monocle3 package (Trapnell lab) was utilized for single-cell trajectory inference and normalization. Dependencies including Bioconductor packages and common data manipulation and visualization libraries were loaded as detailed below.

1.2.1. Installation and Setup
# Package Installation
required_packages <- c(
  # Core analysis
  "dplyr", "tidyr", "data.table", "ggplot2", "tibble",
  # Visualization
  "plotly", "pheatmap", "htmlwidgets", "gridExtra", "dendextend", "readr",
  # Data processing
  "VGAM", "zoo", "broom", "reshape2", "purrr", "splines"
)

# Install CRAN packages
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) {
  install.packages(new_packages, dependencies = TRUE)
}

# Install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

bioc_packages <- c(
  'BiocGenerics', 'DelayedArray', 'DelayedMatrixStats',
  'limma', 'lme4', 'S4Vectors', 'SingleCellExperiment',
  'SummarizedExperiment', 'batchelor', 'HDF5Array',
  'terra', 'ggrastr'
)

new_bioc <- bioc_packages[!(bioc_packages %in% installed.packages()[,"Package"])]
if(length(new_bioc)) {
  BiocManager::install(new_bioc)
}

# Install GitHub packages
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

github_packages <- c(
  'cole-trapnell-lab/monocle3',
  'pln-team/PLNmodels',
  'cole-trapnell-lab/hooke'
)

# Check which GitHub packages need installation
if (!requireNamespace("monocle3", quietly = TRUE)) {
  devtools::install_github(github_packages[1])
}
if (!requireNamespace("PLNmodels", quietly = TRUE)) {
  remotes::install_github(github_packages[2])
}
if (!requireNamespace("hooke", quietly = TRUE)) {
  devtools::install_github(github_packages[3])
}

# Load all packages
all_packages <- c(required_packages, "monocle3", "PLNmodels", "hooke")

invisible(lapply(all_packages, function(pkg) {
  if (!require(pkg, character.only = TRUE)) {
    warning(paste("Package", pkg, "failed to load"))
  }
}))

# Verify loading
loaded <- sapply(all_packages, require, character.only = TRUE, quietly = TRUE)
if(all(loaded)) {
  message("All packages loaded successfully")
} else {
  warning("Failed to load: ", paste(names(loaded)[!loaded], collapse = ", "))
}

# # Print session info
# message("\nSession info:")
# sessionInfo()
1.2.2. Set Working Directory

The working directory was set to organize data and results appropriately.


setwd("/home/aishwarya/R/dorrity_lab_embl") #enter your working directory

2. Trajectory Inference Using Monocle3

To elucidate growth progression within the single-cell size atlas, trajectory inference was conducted using Monocle3. Principal component analysis (PCA) embeddings, specifically the first three principal components, were utilized in place of default UMAP embeddings to better represent the global structure of the data.

(Note: Principal Component Analysis (PCA) was selected over Uniform Manifold Approximation and Projection (UMAP) embeddings for trajectory inference due to PCA’s ability to preserve global data structure and variance. Unlike UMAP, which emphasizes local neighborhood relationships and nonlinear manifold learning primarily for visualization, PCA provides a linear projection that retains the largest sources of variation across the entire dataset. This characteristic is crucial when modeling developmental trajectories, as it ensures that the principal axes represent meaningful biological gradients—such as size and differentiation status—rather than focusing solely on local clustering. Using PCA embeddings thus enables more robust and interpretable inference of pseudotemporal progression in complex single-cell data.)

Given that morphological size in P. morgani correlates strongly with developmental stage, worm length (Lengthmm) was used as a biologically informed proxy for pseudotime, here referred to as pseudosize. Cells were clustered at high resolution to resolve fine-grained cell states, and a principal graph was constructed across the PCA spaceto model putative differentiation trajectories. Pseudotime values were then inferred along this graph, capturing size-associated developmental progression across individual worms. This approach facilitated the characterization and visualization of divergent cellular trajectories underlying sexual and asexual differentiation.


# Load preprocessed cell_data_set object
cds <- readRDS("size-cds-cellcounts_3e-5-highRes-cell-type.rds")
# Replace default UMAP embeddings with first three PCA components for dimensionality reduction
cds@int_colData@listData$reducedDims@listData$UMAP <- cds@int_colData@listData$reducedDims@listData$PCA[, 1:3]

# Cluster cells at a low resolution for fine granularity
cds <- cluster_cells(cds, resolution = 3e-5)
plot_cells_3d(cds, color_cells_by="Lengthmm")

# Learn principal graph to model developmental trajectories
cds <- learn_graph(cds, learn_graph_control=list(ncenter=7, prune_graph=FALSE, minimal_branch_len=1))
plot_cells_3d(cds, color_cells_by="Lengthmm", cell_size = 30)

# Order cells along the learned trajectory (interactive mode)
#cds <- order_cells(cds)

# Order Cells (NOTE: The root node was initially found in the interactive mode, but here the input is automated for a smooth knitting of the final HTML document.)

root_node <- "Y_6"  

cds <- order_cells(cds, root_pr_nodes = root_node)

# Assign pseudotime values reflecting developmental progression correlated with worm size
cds$pseudosize_pca <- pseudotime(cds)
monocle3_trajectory_plot <- plot_cells_3d(cds, color_cells_by="pseudosize_pca")
monocle3_trajectory_plot


# Extract and arrange metadata by pseudotime for downstream analysis
coldat <- as.data.frame(colData(cds))
coldat <- coldat %>% arrange(pseudosize_pca)

rowdat <- as.data.frame(rowData(cds))

# Extract and reorder counts matrix according to sample order in metadata
cell_counts <- as.matrix(counts(cds))[, coldat$sample]


# Calculate variance explained by PCA components to assess dimensionality reduction
var_explained <- cds@reduce_dim_aux$PCA@listData$model$prop_var_expl
var_explained                                # vector for every PC
 [1] 0.35782603 0.17105484 0.12641509 0.08478567 0.07126308 0.04854720 0.04694388 0.03448373 0.03137934 0.02730115
sum(var_explained[1:3])                      # cumulative for PC1–PC3
[1] 0.655296
sum(var_explained[1:7])                      # cumulative for PC1–PC7
[1] 0.9068358
if (!dir.exists("trajectory_output")) dir.create("trajectory_output")

# Save interactive 3D trajectory plot as an HTML widget
saveWidget(monocle3_trajectory_plot, "trajectory_output/monocle3_trajectory_plot.html", selfcontained = TRUE)

2.1. Dimensionality and Clustering

To determine the appropriate number of clusters in the PCA space, unsupervised k-means clustering was performed on the first three principal components of the cell type composition data. The working hypothesis was that incorporating more principal components (thereby capturing a greater proportion of variance) would yield finer-grained clusters and improve resolution. However, this hypothesis was not supported by the data.

Contrary to expectations, increasing the number of principal components (e.g., 7 PCs to capture ~80–90% of the variance) resulted in a decrease in the number of biologically meaningful clusters. This observation suggests that the additional components may predominantly capture technical noise or low-variance structure, thereby diluting the signal necessary for robust clustering. In contrast, the first three components, which collectively explain approximately 66% of the variance, retained the most relevant biological structure for delineating distinct cell states.

2.1.1. K-Means Elbow Plot

To assess cluster structure, a standard elbow plot was constructed by computing the total within-cluster sum of squares (WSS) for a range of k values. The maximum number of clusters evaluated was set to k = 15 to allow for detection of an inflection point in the within-cluster sum of squares (WSS) curve. This upper limit was chosen to exceed the anticipated number of biologically relevant clusters while avoiding excessive partitioning of the data. The WSS curve exhibited a distinct elbow, beyond which increases in k yielded minimal gains, indicating the optimal number of clusters. A distinct inflection point in the WSS curve indicated the optimal number of clusters (k=4), balancing model complexity and explanatory power.


# Needs ggplot2 and Monocle3 which were already loaded earlier
library(purrr)

# Extract PCA coordinates (first 3 PCs)
pca_coords <- reducedDims(cds)$PCA[, 1:3]  # Adjust to 1:3 if using 3D

# Calculate WSS for k=1 to k=15
set.seed(123)  # Reproducibility
max_k <- 15
wss <- map_dbl(1:max_k, ~{
  kmeans(pca_coords, centers = .x, nstart = 25, iter.max = 50)$tot.withinss
})

# Plot ELbow Curve

elbow_plot <- ggplot(data.frame(k = 1:max_k, WSS = wss), aes(x = k, y = WSS)) +
  geom_line(color = "#377EB8", linewidth = 1.2) +
  geom_point(color = "#E41A1C", size = 3) +
  labs(
    title = "K-means Elbow Plot (PCA Space)",
    x = "Number of clusters (k)",
    y = "Total Within-Cluster Sum of Squares (WSS)"
  ) +
  scale_x_continuous(breaks = 1:max_k) +
  theme_minimal() +
  theme(
    text = element_text(family = "Helvetica"),
    plot.title = element_text(face = "bold", hjust = 0.5),
    panel.grid.minor = element_blank(),
    
    # Solid white background for panel
    panel.background = element_rect(fill = "white", color = NA),
    # Add border around the plot panel
    panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5),
    
    # Optional: add border around entire plot (including margin)
    plot.background = element_rect(color = "black", fill = "white", linewidth = 0.5)
  )

print(elbow_plot)

# Save plot
ggsave("trajectory_output/elbow_plot.png", plot = elbow_plot, width = 8, height = 6, dpi = 300)

NA
NA
2.1.2. K-Means Clustering

To identify discrete cellular states along the developmental trajectory, K-means clustering was performed on cells embedded in PCA space and ordered by pseudotime. All cells were initially assigned to a common early-state label (Prebranch_1). Cells exceeding a defined pseudotime threshold were considered post-branch and partitioned into two major clusters: Branch_1 and Branch_2.

An intermediate cluster, Prebranch_2, emerged within the transitional pseudotime window, representing a putative pre-commitment state preceding bifurcation.Importantly, clustering into three groups revealed a branch point along the growth trajectory that precedes the divergence into asexual and sexual pathways (Branch_1 and Branch_2 respectively). This early inflection is primarily represented by samples assigned to the Prebranch_2 cluster. The distinct positioning of this group suggests that a critical shift in cellular composition or developmental state may occur upstream of overt sexual differentiation. This observation raises the possibility that Prebranch_2 corresponds to a transitional phase characterized by an early commitment or upregulation of developmental programs.

These data-driven clusters delineate key stages of growth in P. morgani, with Branch_1 and Branch_2 corresponding to the committed asexual and sexual fates, respectively. Visualization in 3D PCA space confirmed that the inferred clusters recapitulate the underlying structure of developmental divergence.

Cluster Description Color
Prebranch_1 Early developmental state (common root) Gray
Prebranch_2 Transitional pre-commitment state Orange
Branch_1 Terminal state of the asexual trajectory Blue
Branch_2 Terminal state of the sexual trajectory Red

# Initialize all cells as prebranch_1
cds$branch <- "Prebranch_1"

#vIdentify post-branch cells using pseudotime threshold
pseudotime_threshold <- quantile(cds$pseudosize_pca, probs = 0.33)
post_branch_idx <- which(cds$pseudosize_pca > pseudotime_threshold)

# Branching logic (Main bifurcation: split post-branch cells into two clusters)
if(length(post_branch_idx) > 10) {
  post_coords <- reducedDims(cds)$PCA[post_branch_idx, 1:3]
  set.seed(123)
  km <- kmeans(post_coords, centers = 2)
  cds$branch[post_branch_idx] <- ifelse(km$cluster == 1, "Branch_1", "Branch_2")
  
  # Secondary clustering (Secondary bifurcation: split Branch_1 into Prebranch_2 and Branch_1)
  branch1_idx <- which(cds$branch == "Branch_1")
  if(length(branch1_idx) > 5) {
    branch1_coords <- reducedDims(cds)$PCA[branch1_idx, 1:3]
    set.seed(123)
    km2 <- kmeans(branch1_coords, centers = 2)
    cds$branch[branch1_idx] <- ifelse(km2$cluster == 1, "Branch_1", "Prebranch_2")
  }
}

# Define and order final cluster labels
cds$branch <- factor(cds$branch, 
                     levels = c("Prebranch_1", "Prebranch_2", "Branch_1", "Branch_2"))

names(cds$branch) <- colnames(cds)

# Color scheme
branch_colors <- c(
  "Prebranch_1" = "#999999",
  "Branch_1" = "#377EB8",
  "Prebranch_2" = "orange",
  "Branch_2" = "#E41A1C"
)


# 3D scatter plot of clusters 
clustering_plot_4 <- plot_ly()

for(branch in levels(cds$branch)) {
  if(any(cds$branch == branch)) {
    clustering_plot_4 <- clustering_plot_4 %>% add_trace(
      x = reducedDims(cds)$PCA[cds$branch == branch, 1],
      y = reducedDims(cds)$PCA[cds$branch == branch, 2],
      z = reducedDims(cds)$PCA[cds$branch == branch, 3],
      type = "scatter3d", mode = "markers",
      marker = list(
        size = ifelse(branch == "Prebranch_1", 4, 4),
        color = branch_colors[branch]
      ),
      name = branch
    )
  }
}

clustering_plot_4 <- clustering_plot_4 %>% 
  config(autosizable = TRUE) %>% 
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    ),
    legend = list(
      orientation = "h", 
      x = 0.5, 
      xanchor = "center", 
      y = -0.15,  # Try more negative values if needed
      yanchor = "top"
    )
  )

clustering_plot_4

# Save plot
saveWidget(clustering_plot_4, "trajectory_output/clustering_plot_4.html", selfcontained = TRUE)

#NOTE: Another idea would be to try the Leiden clustering. Since Monocle3 already does that in the learn_graph() step, the k-means clustering (as illustrated above) should suffice.

2.2. Defining Trajectories

To enable focused downstream analyses, four biologically relevant trajectory subsets were delineated, each representing specific stages or fate decisions within P. morgani growth:

  • Early Development: comprising Prebranch_1 and Prebranch_2, representing initial stages prior to major fate commitment.

  • Asexual Trajectory: encompassing Prebranch_1, Prebranch_2, and Branch_1, corresponding to the developmental path leading to asexual worms.

  • Sexual Trajectory: including Prebranch_1, Prebranch_2, and Branch_2, representing the trajectory towards sexual differentiation.

  • Late Development: consisting of Branch_1 and Branch_2, capturing the terminal phases of both asexual and sexual lineages.

These trajectory definitions provide a structured framework for systematic comparison of cell type composition, dynamic variability, and pseudotemporal progression.


## Defining Clusters and Trajectories

# Set consistent cluster ordering and associated colors
branch_order <- c("Prebranch_1", "Prebranch_2", "Branch_1", "Branch_2")
branch_colors <- c(
  "Prebranch_1" = "#999999",  # grey
  "Prebranch_2" = "orange",   # orange
  "Branch_1"    = "#377EB8",  # blue
  "Branch_2"    = "#E41A1C"   # red
)

#coldat$cluster <- cds$branch[rownames(coldat)]  

# Assign cluster labels to cells
coldat$cluster <- as.character(cds$branch)[match(rownames(coldat), names(cds$branch))]
coldat$cluster <- factor(coldat$cluster, levels = c("Prebranch_1", "Prebranch_2", "Branch_1", "Branch_2"))

# Define trajectory groupings and their branch composition
trajectories <- list(
  early_dev = list(
    name = "Early Development",
    branches = c("Prebranch_1", "Prebranch_2"),
    color = c("#999999", "orange")
  ),
  asexual = list(
    name = "Asexual Trajectory",
    branches = c("Prebranch_1", "Prebranch_2", "Branch_1"),
    color = c("#999999", "orange", "#377EB8")
  ),
  sexual = list(
    name = "Sexual Trajectory",
    branches = c("Prebranch_1", "Prebranch_2", "Branch_2"),
    color = c("#999999", "orange", "#E41A1C")
  ),
  late_dev = list(
    name = "Late Development",
    branches = c("Branch_1", "Branch_2"),
    color = c("#377EB8", "#E41A1C")
  ),
  all_clusters = list(
    name = "All Clusters",
    branches = levels(cds$branch),
    color = unname(branch_colors[levels(cds$branch)])
  )
)


# Optional diagnostic: Check how many cells fall under each trajectory
for (traj_name in names(trajectories)) {
  clusters <- trajectories[[traj_name]]$branches
  cells <- rownames(coldat)[coldat$cluster %in% clusters]
  cat(sprintf("Trajectory '%s' contains %d cells across clusters: %s\n",
              traj_name, length(cells), paste(clusters, collapse = ", ")))
}
Trajectory 'early_dev' contains 54 cells across clusters: Prebranch_1, Prebranch_2
Trajectory 'asexual' contains 71 cells across clusters: Prebranch_1, Prebranch_2, Branch_1
Trajectory 'sexual' contains 77 cells across clusters: Prebranch_1, Prebranch_2, Branch_2
Trajectory 'late_dev' contains 40 cells across clusters: Branch_1, Branch_2
Trajectory 'all_clusters' contains 94 cells across clusters: Prebranch_1, Prebranch_2, Branch_1, Branch_2
2.2.1. Defining Branch Points and Visualizing Trajectories

Branch points within the developmental trajectory were identified as the minimum pseudotime values corresponding to transitions into distinct clusters (Prebranch_2, Branch_1, Branch_2). These branch points mark critical fate decisions along the pseudotemporal axis derived from the PCA embedding.

# Ensure pseudotime is available
if ("pseudosize_pca" %in% colnames(colData(cds))) {

  # Find branch transitions
  transitions <- list(
    Prebranch_2 = min(cds$pseudosize_pca[cds$branch == "Prebranch_2"], na.rm = TRUE),
    Branch_1 = min(cds$pseudosize_pca[cds$branch == "Branch_1"], na.rm = TRUE), 
    Branch_2 = min(cds$pseudosize_pca[cds$branch == "Branch_2"], na.rm = TRUE)
  )

  # Remove any invalid transitions (Inf, NA)
  transitions <- transitions[!is.infinite(unlist(transitions)) & !is.na(unlist(transitions))]

  # Create base 3D plot
  base_plot <- plot_cells_3d(
    cds,
    color_cells_by = "branch",
    cell_size = 30,
    trajectory_graph_color = "black",
    trajectory_graph_segment_size = 3
  )

  branch_points_plot <- base_plot

  # Helper: Get PCA coordinates near pseudotime ---
  get_coords_at_pseudotime <- function(cds, pseudotime_val) {
    idx <- which.min(abs(cds$pseudosize_pca - pseudotime_val))
    reducedDims(cds)$PCA[idx, 1:3]
  }

  # Add branch points with clean labels
  for (branch in names(transitions)) {
    coords <- get_coords_at_pseudotime(cds, transitions[[branch]])

    branch_points_plot <- branch_points_plot %>% add_trace(
      x = coords[1], y = coords[2], z = coords[3],
      type = "scatter3d",
      mode = "markers",
      marker = list(
        size = 12,
        color = branch_colors[branch],
        symbol = "diamond"
      ),
      name = sprintf("%s (t=%.2f)", branch, transitions[[branch]]),
      hoverinfo = "text",
      text = sprintf("%s start\nPseudotime: %.2f\nPC1: %.1f\nPC2: %.1f\nPC3: %.1f",
                     branch, transitions[[branch]], coords[1], coords[2], coords[3]),
      showlegend = TRUE
    )
  }

  # Create density plot with ordered legend
  branch_order <- c("Prebranch_1", "Prebranch_2", "Branch_1", "Branch_2")
  plot_data <- as.data.frame(colData(cds))
  plot_data$branch <- factor(plot_data$branch, levels = branch_order)

  density_plot <- ggplot(plot_data, 
                       aes(x = pseudosize_pca, fill = branch)) +
      geom_density(alpha = 0.5) +
      geom_vline(xintercept = unlist(transitions), 
                 linetype = "dashed",
                 color = branch_colors[names(transitions)]) +
      scale_fill_manual(values = branch_colors[branch_order],
                        breaks = branch_order) +
      labs(x = "Pseudotime", y = "Density", title = "Branch Transitions") +
      theme_minimal() +
      theme(
          text = element_text(family = "Helvetica"),
          legend.position = "bottom",
          legend.justification = "center",
          plot.title = element_text(hjust = 0.5, face = "bold"),  
          panel.border = element_rect(color = "black", fill = NA, size = 0.5),  
          plot.background = element_rect(color = "black", fill = NA, linewidth = 0.5)  
    )

  # Optional: Print transition info
  cat("=== Branch Transition Pseudotimes ===\n")
  print(data.frame(
    Branch = names(transitions),
    Pseudotime = round(unlist(transitions), 2),
    PC1 = sapply(names(transitions), function(x) round(reducedDims(cds)$PCA[which.min(abs(cds$pseudosize_pca - transitions[[x]])), 1], 2)),
    PC2 = sapply(names(transitions), function(x) round(reducedDims(cds)$PCA[which.min(abs(cds$pseudosize_pca - transitions[[x]])), 2], 2)),
    PC3 = sapply(names(transitions), function(x) round(reducedDims(cds)$PCA[which.min(abs(cds$pseudosize_pca - transitions[[x]])), 3], 2))
  ))

  # Return plots
  print(branch_points_plot)
  print(density_plot)

} else {
  warning("Pseudotime column 'pseudosize_pca' not found")
  NULL
}
=== Branch Transition Pseudotimes ===
#Save plots
ggsave("trajectory_output/density_plot.png", plot = density_plot, width = 8, height = 6, dpi = 300)


saveWidget(branch_points_plot, "trajectory_output/ branch_points_plot.html", selfcontained = TRUE)

# Minimum Spanning Tree coordinates (3 rows = PC1, PC2, PC3; 7 cols = nodes) manually retrieved from Monocle3 cds object: cds@principal_graph_aux@listData[["UMAP"]][["dp_mst"]]

mst_coords_raw <- matrix(
  c(-1.002902,  1.8693896,  1.892249,  3.2705521,  4.21178119, -5.3767044, -3.54361406,
    -1.385229, -1.9476252, -2.767477,  0.5152539,  3.19226895,  1.3584236,  0.07889964,
    -2.016103, -0.8172786,  2.155795, -0.4427244,  0.01372972,  0.9358482, -0.44058576),
  nrow = 3, byrow = TRUE,
  dimnames = list(c("PC1", "PC2", "PC3"), paste0("Y_", 1:7))
)

# Transpose so rows = nodes, cols = PCs
mst_coords <- t(mst_coords_raw)
colnames(mst_coords) <- c("PC1", "PC2", "PC3")

# MST edges manually retreived from cds object and defined (node indices, matching rownames in mst_coords)
mst_edges <- list(
  c(1,2),
  c(2,3),
  c(2,4),
  c(4,5),
  c(1,7),
  c(6,7)
)

# Optional: Define trajectories and their branches & colors (This is optional as it is already defined above)
trajectories <- list(
  early_dev = list(
    name = "Early Development",
    branches = c("Prebranch_1", "Prebranch_2"),
    color = c("#999999", "orange")
  ),
  asexual = list(
    name = "Asexual Trajectory",
    branches = c("Prebranch_1", "Prebranch_2", "Branch_1"),
    color = c("#999999", "orange", "#377EB8")
  ),
  sexual = list(
    name = "Sexual Trajectory",
    branches = c("Prebranch_1", "Prebranch_2", "Branch_2"),
    color = c("#999999", "orange", "#E41A1C")
  ),
  late_dev = list(
    name = "Late Development",
    branches = c("Branch_1", "Branch_2"),
    color = c("#377EB8", "#E41A1C")
  ),
  all_clusters = list(
    name = "All Clusters",
    branches = levels(cds$branch),
    color = unname(branch_colors[levels(cds$branch)])
  )
)

# Add sample IDs for each trajectory
for (traj_name in names(trajectories)) {
  trajectories[[traj_name]]$samples <- rownames(colData(cds))[cds$branch %in% trajectories[[traj_name]]$branches]
}

# Helper function to get coordinates at minimal pseudotime in a branch
get_coords_at_pseudotime <- function(cds, pseudotime_val) {
  idx <- which.min(abs(cds$pseudosize_pca - pseudotime_val))
  reducedDims(cds)$PCA[idx, 1:3]
}

# Branch transition pseudotime values
transitions <- list(
  Prebranch_2 = min(cds$pseudosize_pca[cds$branch == "Prebranch_2"], na.rm = TRUE),
  Branch_1 = min(cds$pseudosize_pca[cds$branch == "Branch_1"], na.rm = TRUE),
  Branch_2 = min(cds$pseudosize_pca[cds$branch == "Branch_2"], na.rm = TRUE)
)
transitions <- transitions[!is.infinite(unlist(transitions)) & !is.na(unlist(transitions))]

# Plotting loop
trajectory_plots <- lapply(trajectories, function(traj) {
  plot_colors <- branch_colors
  plot_colors[!names(plot_colors) %in% traj$branches] <- "#F0F0F0"
  
  trajectory_plot <- plot_ly() %>%
    layout(
      scene = list(
        xaxis = list(title = "PC1"),
        yaxis = list(title = "PC2"),
        zaxis = list(title = "PC3")
      ),
      title = list(text = traj$name, y = 0.95, x = 0.5, xanchor = "center"),
      legend = list(orientation = "h", x = 0.5, xanchor ="center", y = -0.1, yanchor = "top")
    )
  
  for(branch in levels(cds$branch)) {
    trajectory_plot <- trajectory_plot %>% add_trace(
      x = reducedDims(cds)$PCA[cds$branch == branch, 1],
      y = reducedDims(cds)$PCA[cds$branch == branch, 2],
      z = reducedDims(cds)$PCA[cds$branch == branch, 3],
      type = "scatter3d",
      mode = "markers",
      marker = list(
        size = ifelse(branch %in% traj$branches, 4, 3),
        color = plot_colors[branch],
        opacity = ifelse(branch %in% traj$branches, 1, 0.3)
      ),
      name = ifelse(branch %in% traj$branches, branch, paste0(branch, " (other)")),
      showlegend = branch %in% traj$branches
    )
  }
  
  for(edge in mst_edges) {
    trajectory_plot <- trajectory_plot %>% add_trace(
      x = mst_coords[edge, "PC1"],
      y = mst_coords[edge, "PC2"],
      z = mst_coords[edge, "PC3"],
      type = "scatter3d",
      mode = "lines",
      line = list(color = "black", width = 4),
      showlegend = FALSE
    )
  }
  
  for(branch in intersect(names(transitions), traj$branches)) {
    coords <- get_coords_at_pseudotime(cds, transitions[[branch]])
    trajectory_plot <- trajectory_plot %>% add_trace(
      x = coords[1], y = coords[2], z = coords[3],
      type = "scatter3d",
      mode = "markers",
      marker = list(
        size = 7,
        color = branch_colors[branch],
        symbol = "x"
      ),
      name = sprintf("%s (t=%.1f)", branch, transitions[[branch]]),
      hoverinfo = "text",
      text = sprintf("%s start\nPseudotime: %.2f", branch, transitions[[branch]]),
      showlegend = TRUE
    )
  }
  
  return(trajectory_plot)
})

# Display plots
trajectory_plots$early_dev
trajectory_plots$asexual
trajectory_plots$sexual
trajectory_plots$late_dev
trajectory_plots$all_clusters


saveWidget(trajectory_plots$early_dev, "trajectory_output/early_dev.html", selfcontained = TRUE)
saveWidget(trajectory_plots$asexual, "trajectory_output/asexual.html", selfcontained = TRUE)
saveWidget(trajectory_plots$sexual, "trajectory_output/sexual.html", selfcontained = TRUE)
saveWidget(trajectory_plots$late_dev, "trajectory_output/late_dev.html", selfcontained = TRUE)
saveWidget(trajectory_plots$all_clusters, "trajectory_output/all_clusters.html", selfcontained = TRUE)

3. Abundance Analysis

Cell type abundance patterns were analyzed along pseudotime trajectories to identify compositional changes associated with growth. The analysis focused on ±5% pseudotime windows around each branch point, comparing local cell type proportions to their global mean abundance across the corresponding trajectory. Wilcoxon rank-sum tests identified significantly enriched or depleted cell types during transitions, with three alternative hypotheses tested: enrichment, depletion, and non-directional changes (“two.sided”). Results were visualized using smoothed loess curves showing cell type proportions along pseudotime, with branch points marked by vertical lines.

Enriched cell types near branch points may represent transitional states or fate-committed progenitors. These populations often play critical roles in lineage bifurcation, either as bipotent intermediates or as sources of instructive signals. Conversely, depleted cell types suggest mutually exclusive fate choices or temporal separation of developmental programs. Cell types showing context-dependent abundance changes across different branches may indicate developmental plasticity, where populations retain the capacity for multiple fate decisions based on local conditions.

The timing of abundance peaks relative to branch points provides insights into the ordering of molecular events. Early-appearing cell types likely reflect initial fate priming, while late-appearing populations may represent terminal differentiation states. The pseudotemporal windowing approach captures these dynamics while accounting for variability in differentiation kinetics across cells. All findings were interpreted considering the compositional nature of the data, where changes in one cell type necessarily influence the interpretation of others in the ecosystem.

Technical validation confirmed the robustness of results to pseudotime reconstruction parameters and window size selection. While the analysis provides statistical identification of interesting cell populations, deeper biological interpretation requires integration with known lineage relationships and marker gene expression patterns from the broader dataset. The combined approach reveals how cellular heterogeneity is dynamically regulated during growth.

Trajectory Enriched Cell Types
Early Development
Prebranch_2 - Enrichment
Cathepsin 5, Intestine 1, Intestine 6, Cathepsin 2, Neural 5, Cathepsin 6, Parenchymal 3, Neural 8, Neural 1, Intestine 4
Asexual
Branch_1 - Enrichment
Cathepsin 1, Cathepsin 3, Neural 4, Intestine 5, Cathepsin 4, Parenchymal 6, Epidermal 5, Intestine 3, Cathepsin 8, Parenchymal 1
Sexual
Branch_2 - Enrichment
Parenchymal 5, Testes, Size specific 2, Epidermal 5, Neural 4, Size specific 1, Cathepsin 4, Cathepsin 3, Parenchymal 1, Epidermal 8
Trajectory Depleted Cell Types
Early Development
Prebranch_2 - Depletion
Size specific 1, Epidermal 4, Size specific 2, Parenchymal 5, Epidermal 3, Epidermal 8, Epidermal 2, Neural 10, Intestine 2, Testes
Asexual
Branch_1 - Depletion
Epidermal 8, Muscle 3, Muscle 2, Muscle 5, Neural 6, Parenchymal 2, Neoblast 2, Parenchymal 4, Neural 2, Parenchymal 5
Sexual
Branch_2 - Depletion
Muscle 2, Neural 1, Muscle 3, Epidermal 1, Parenchymal 3, Cathepsin 5, Neural 2, Neural 3, Intestine 2, Intestine 6


# Ensure sample names match between cell_counts and coldat
rownames(cell_counts) <- gsub("-", ".", rownames(cell_counts))  # Harmonize naming if needed
coldat <- coldat[colnames(cell_counts), ]  # Reorder coldat to match cell_counts columns

# Normalize counts to proportions (if not already normalized)
cell_props <- t(t(cell_counts) / colSums(cell_counts))  # Rows = cell types, columns = samples

#This was previously listed in another chunk, but I am running it again just for clarity
transitions <- list(
  Prebranch_2 = min(cds$pseudosize_pca[cds$branch == "Prebranch_2"], na.rm = TRUE),
  Branch_1 = min(cds$pseudosize_pca[cds$branch == "Branch_1"], na.rm = TRUE), 
  Branch_2 = min(cds$pseudosize_pca[cds$branch == "Branch_2"], na.rm = TRUE)
)


# min_prebranch2_time <- transitions$Prebranch_2
# min_branch1_time <- transitions$Branch_1
# min_branch2_time <- transitions$Branch_2

# --------------------------------------------------------------------------------------------------
# Function: analyze_abundance_for_branch
# Purpose: Identify cell types whose abundance significantly changes around a trajectory branch point.
# It computes fold-change, performs Wilcoxon tests, and visualizes smoothed abundance trends.
# --------------------------------------------------------------------------------------------------

analyze_abundance_for_branch <- function(
  cds, cell_props, coldat, trajectories, transitions,
  trajectory_key, branch_name,
  test_type = c("Enrichment", "Depletion", "Two Sided Test"),
  transition_color = "black",
  save_prefix = NULL
) {
  
    # Choose alternative hypothesis for Wilcoxon test based on user input
  wilcox_alternative <- switch(match.arg(test_type),
    "Enrichment" = "greater",
    "Depletion" = "less",
    "Two Sided Test" = "two.sided"
  )
  
  #Load necessary libraries
  library(dplyr)
  library(broom)
  library(ggplot2)
  library(reshape2)

  test_type <- match.arg(test_type)

  # Retrieve metadata for the trajectory of interest
  traj <- trajectories[[trajectory_key]]
  clusters <- traj$branches
  transition_time <- transitions[[branch_name]]
  
  # Define the pseudotime window (±5%) around the branch point
  transition_window <- 0.05 * max(coldat$pseudosize_pca, na.rm = TRUE)

  # Identify sample IDs in this trajectory
  traj_samples <- rownames(coldat)[coldat$cluster %in% clusters]

  # Compute fold-change in abundance (transition window vs global)
  compute_fc <- function(cell_type) {
    global_mean <- mean(cell_props[cell_type, traj_samples], na.rm = TRUE)
    window_samples <- rownames(coldat)[
      coldat$pseudosize_pca >= (transition_time - transition_window) &
      coldat$pseudosize_pca <= (transition_time + transition_window)
    ]
    window_mean <- mean(cell_props[cell_type, window_samples], na.rm = TRUE)
    data.frame(
      cell_type = cell_type,
      global_prop = global_mean,
      transition_prop = window_mean,
      fold_change = window_mean / global_mean
    )
  }

  # Perform Wilcoxon rank-sum test to assess enrichment/depletion
  test_enrichment <- function(cell_type) {
    in_window <- coldat$pseudosize_pca >= (transition_time - transition_window) &
                 coldat$pseudosize_pca <= (transition_time + transition_window)
    test <- wilcox.test(
      x = cell_props[cell_type, in_window],
      y = cell_props[cell_type, !in_window],
      alternative = wilcox_alternative
    )
  #   tidy(test) %>% mutate(cell_type = cell_type)
  # }
    # Create result data frame directly (no broom dependency)
    data.frame(
      cell_type = cell_type,
      p_value = test$p.value,
      statistic = test$statistic,
      method = test$method,
      stringsAsFactors = FALSE
    )
}

  # Run fold-change computation and statistical test for all cell types
  abundance_summary <- bind_rows(lapply(rownames(cell_props), compute_fc))
  enrichment_results <- bind_rows(lapply(rownames(cell_props), test_enrichment))
  
  # Merge fold-change and p-value results, adjust format
  abundance_summary <- abundance_summary %>%
    left_join(enrichment_results, by = "cell_type") %>%
    #rename(p_value = p.value) %>%
    arrange(p_value)

  # Visualize abundance trends for top 10 significant cell types
  top_celltypes <- head(abundance_summary$cell_type, 10)

  # Prepare data for plotting
  plot_data <- melt(
    data.frame(
      sample = rownames(coldat),
      pseudosize_pca = coldat$pseudosize_pca,
      t(cell_props[top_celltypes, ])
    ),
    id.vars = c("sample", "pseudosize_pca"),
    variable.name = "cell_type",
    value.name = "proportion"
  )
  
  # Generate LOESS-smoothed plot of abundance vs pseudotime
  abundance_plot <- ggplot(plot_data, aes(x = pseudosize_pca, y = proportion, color = cell_type)) +
    geom_smooth(method = "loess", se = FALSE) +
    geom_vline(xintercept = transition_time, linetype = "dashed", color = transition_color) +
    labs(
      title = paste(traj$name, ":", branch_name, "-", test_type),
      x = "Pseudotime",
      y = "Proportion",
      color = "Cell Type"
    ) +
    theme_minimal() +
  theme(
    text = element_text(family = "Helvetica"),
    plot.title = element_text(hjust = 0.5, face = "bold"),  
    panel.border = element_rect(color = "black", fill = NA, linewidth = 0.5),  
    plot.background = element_rect(color = "black", fill = NA, linewidth = 0.5)  
  )
  
  print(abundance_plot)

  # Save outputs if prefix given
  if (!is.null(save_prefix)) {
    ggsave(filename = paste0(save_prefix, "_", test_type, ".jpg"), plot = abundance_plot, width = 8, height = 5, dpi = 300)
    write.csv(abundance_summary, paste0(save_prefix, "_", test_type, "_summary.csv"), row.names = FALSE)
  }

  return(list(summary = abundance_summary, plot = abundance_plot))
}


# --------------------------------------------------------------------------------------------------
# Function: run_abundance_analysis_batch
# Purpose: Batch wrapper to apply abundance analysis across multiple branches and test types.
# --------------------------------------------------------------------------------------------------


run_abundance_analysis_batch <- function(
  cds, cell_props, coldat, trajectories, transitions,
  analysis_plan,
  test_types = c("Enrichment", "Depletion", "Two Sided Test"),
  save_dir = NULL
) {
  results <- list()
  
  # Loop over all entries in the analysis plan (each trajectory and branch)
  for (entry in analysis_plan) {
    for (test in test_types) {
      prefix <- if (!is.null(save_dir)) {
        file.path(save_dir, paste0(entry$trajectory_key, "_", entry$branch_name))
      } else {
        NULL
      }
      
      # Run Analysis
      res <- analyze_abundance_for_branch(
        cds = cds,
        cell_props = cell_props,
        coldat = coldat,
        trajectories = trajectories,
        transitions = transitions,
        trajectory_key = entry$trajectory_key,
        branch_name = entry$branch_name,
        test_type = test,
        save_prefix = prefix
      )

      results[[paste(entry$trajectory_key, entry$branch_name, test, sep = "_")]] <- res
    }
  }

  return(results)
}


# --------------------------------------------------------------------------------------------------
# Configuration: Analysis Plan
# Each list item defines a branch point and corresponding trajectory to analyze.
# --------------------------------------------------------------------------------------------------


analysis_plan <- list(
  list(trajectory_key = "sexual",  branch_name = "Branch_2"),
  list(trajectory_key = "asexual", branch_name = "Branch_1"),
  list(trajectory_key = "early_dev", branch_name = "Prebranch_2")
)

# --------------------------------------------------------------------------------------------------
# Run the Full Batch Analysis
# Results are stored in a named list indexed by trajectory/branch/test
# --------------------------------------------------------------------------------------------------

results <- run_abundance_analysis_batch(
  cds = cds,
  cell_props = cell_props,
  coldat = coldat,
  trajectories = trajectories,
  transitions = transitions,
  analysis_plan = analysis_plan,
  test_types = c("Enrichment", "Depletion", "Two Sided Test"),
  save_dir = "abundance_output"  # Set to NULL to disable saving
)

1: Yes
2: No
Yes